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Abstract 

We introduce new methods for phylogenetic tree quartet construction by using ma- 
chine learning to optimize the power of phylogenetic invariants. Phylogenetic invariants 
are polynomials in the joint probabilities which vanish under a model of evolution on 
a phylogenetic tree. We give algorithms for selecting a good set of invariants and for 
learning a metric on this set of invariants which optimally distinguishes the differ- 
ent models. Our learning algorithms involve linear and semidefinite programming on 
data simulated over a wide range of parameters. We provide extensive tests of the 
learned metrics on simulated data from phylogenetic trees with four leaves under the 
Jukes-Cantor and Kimura 3-parameter models of DNA evolution. Our method greatly 
improves on other uses of invariants and is competitive with or better than neighbor- 
joining. In particular, we obtain metrics trained on trees with short internal branches 
which perform much better than neighbor joining on this region of parameter space. 

Keywords: Phylogenetic invariants, algebraic statistics, semidefinite programming, 
Felsenstein zone. 



1 Introduction 

Phylogenetic invariants have been used for tree construction since the linear invariants of 
Lake and Cavender- Felsenstein |Lak87| ICF87] were found. Although the linear invariants of 
the Jukes-Cantor model are powerful enough to asymptotically distinguish between trees on 
4 taxa |Lak87j . these linear invariants do not perform well in simulations |Hue95j . 

In the last 20 years, the entire set of phylogenetic invariants has been found for many 
models of evolution (see |AR07] and the references therein). Since the invariants provide an 
essentially complete description of the model, using more invariants should give more power 
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to distinguish between different models. However, different invariants give vastly different 
power at distinguishing models and it is not known how to find the most powerful invariants. 

In this paper, we use techniques from machine learning to find metrics on the space of 
invariants which optimize their tree reconstruction power for the Jukes-Cantor and Kimura 
3-parameter phylogenetic models on trees with four leaves. Specifically, we apply metric 
learning algorithms inspired by [XNJR03J to find the metric which best distinguishes the 
models. For training data we use simulations over a wide range of the parameter space. Our 
main biological result is the construction of a metric which outperforms neighbor-joining 
on trees simulated from the Felsenstein zone (i.e., trees with a short interior edge). More 
generally, we find that metric learning significantly improves upon other uses of invariants 
and is competitive with neighbor joining even for short sequences and homogeneous rates. 

Casanellas and Fernandez-Sanchez |CFS06j also used the Kimura 3-parameter invariants 
to construct trees with four taxa. Their results indicated that invariants can sometimes per- 
form better than commonly used methods (e.g., neighbor joining and maximum likelihood) 
for data that evolved with non-homogeneous rates and for extremely long sequences. They 
used the l\ norm on the space of invariants, weighing each polynomial equally. 

This paper improves upon [CFS06J by showing how to improve upon the l\ norm on the 
space of invariants. The l\ norm behaves poorly since it weighs equally informative and non- 
informative invariants. Simulating data and using metric learning improves the performance 
of invariants by putting much more weight on the powerful invariants. This allows us to 
build an algorithm which is very accurate for trees with short internal edges. 

We begin by briefly introducing the models and phylogenetic invariants we will use. 
Section[2]describes the metric learning algorithms; section[3]gives the results of our simulation 
studies; and we conclude with a short discussion. 

By phylogenetic tree, we mean a binary, unrooted tree with labelled leaves. There are 
three such trees with four leaves labelled 0, 1, 2, 3, we call these trees T 1; T 2 , and T 3 according 
to which leaf is on the same "cherry" as leaf 0. We consider two phylogenetic models on 
these trees: the Jukes-Cantor (JC69) model of evolution |JC69j and the Kimura 3-parameter 
(K81) model |Kim81j . both with uniform root distribution. 

These models associate to each edge e of the tree a transition matrix M e = e® te where 
where t e is the length of the edge e and Q is a rate matrix: 
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for JC69 and K81 respectively, where ■ = —7 — a — (3. For a given tree, we write pijki = 
Pr(leaf = i, leaf 1 = j, leaf 2 = k, leaf 3 = /) for i,j,k,l G {A,C,G,T} and write p = 
(Paaaa, • • • , Ptttt) for the joint probability distribution. Phylogenetic invariants are polyno- 
mial equations which are satisfied between the joint parameters. For example, Paaaa = Pcccc 
holds for both JC69 and K81, but since this equation is true for all three trees, will ignore 
it and similar equations. 
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Example 1. Consider the four-point condition on a tree metric |Bun74j. It says that if d is 
a tree metric on (ij : kl), then 

dij + dki < dik + dji = du + djk- (1) 
Given a probability distribution p, the maximum likelihood Jukes-Cantor distance is 

d ij = - l ] g\l--^.\ (2) 

where is the fraction of mismatches between the two sequences, e.g., 

^12 = Pwxyz- 

w,x,y,z£{A,C,G,T},w^=x 

After substituting in (JI]) and exponentiating, the equality becomes 

This observation is originally due to Cavender and Felsenstein [CF87] . The difference of the 
two sides of ^ is a quadratic polynomial in the joint probabilities which we will call the 
four-point polynomial. 



2 Methods 

Since both models we consider are group-based, it is easiest to work in Fourier coordinates 
which can roughly be thought of as the m^- coordinates in Example [I] (cf. |HP89l IES931 



ISS05] ). The website [http : //www . math . t amu . edu/~lgp/ small-trees/ contains lists of in- 



variants for different models on trees with a small number of taxa. 

Our first task is building a set of invariants for the two models. The above website shows 
33 polynomials (plus two implied linear relations) for JC69 and 8002 polynomials for K81. 
However, these sets of invariants are not closed under the symmetries of T\. That is, each 
tree can be written in the plane in eight different ways (for example, the tree T\ can be 
written as (01 : 23), (10 : 23), . . . , (32 : 10)), and each of these induces a different order on 
the probability coordinates Pijki- We need a set of invariants which does not change under 
this reordering if we don't want the resulting algorithm to depend on the order of the input 
sequences. 

After performing this calculation, we are left with 49 polynomials for JC69 and 11612 for 
K81. However, our metric learning algorithms run slowly as the number of invariants grows, 
so we had to find a subset of a more manageable size. We cut down the K81 invariants 
by testing each of the 11612 invariants individually on the entire parameter space and only 
keeping those which had good individual reconstruction rates. Specifically, we picked several 
different values for 7, a, (3 and kept only those invariants which gave over a 62% reconstruc- 
tion rate individually for sequences of length 100. The result of this calculation is sets of 
invariants f: JC69 and fj- 1 of cardinality 49 and 52. 
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Figure 1: The tree used in the simulations. Branch lengths a and b ranged from 0.01 to 0.75 
in intervals of 0.02. 

Definition 2. Given a probability distribution p, and invariants f, = . . . , fi, n ), for tree 



be the point in M. n obtained by evaluating the invariants for Ti at p. 

If the probability distribution p actually comes from the model Ti, then we will have 
= 0; ^(p) 7^ 0, and f 3 (p) ^ generically (that is, except for points p which lie on the 
intersection of two or more models). This fact suggests that we can just pick the tree % 
such that fj is closest to zero. However, the next example shows that it is quite important 
to pick good polynomials and weigh them properly. 

Example 3. Figure [2] shows the distribution of four of the invariants from ff c on data from 
simulations of 1000 i.i.d. draws from the Jukes-Cantor model on T\ over a varying set of 
parameters. The histograms show the distributions for the simulated tree (Ti) in yellow and 
the distributions for the other trees in gray and black. 

Polynomial 10 (upper left) distinguishes nicely between the three trees with the correct 
tree tightly distributed around zero. It is correct 97% of the time on our space of trees 
(Figure [I]). Polynomial 48 (upper right) also shows power to distinguish between all three 
trees, but the distributions are much more overlapping — it is only correct 50.8% of the 
time. Polynomial 10 is the four-point invariant from Example [TJ polynomial 48 is one of 
Lake's linear invariants. The two other examples show a polynomial (23) which is biased 
towards selecting the wrong tree (only 16% correct), and a polynomial (45) for which the 
correct tree is tightly clustered around zero, but the incorrect trees are indistinguishable and 
have wide variance (88.9% correct). 

The parameters used for the simulations are described in Figure [TJ Since 1000 samples 
should be quite enough to determine the structure of a tree on four taxa, it is revealing 
that many of the individual polynomials are quite poor (the mean prediction rate for all 49 
polynomials is only 42%). The invariants have quite different variances and means and it is 
not optimal to take each one with equal weight. 

This example shows that we need to scale and weigh the individual invariants. Recall 
that for a positive (semi)definite matrix A the Mahalanobis (semi)norm || • is defined by 



Ti (fori = 1,2,3;, let 
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Figure 2: Distributions of four polynomials fi,io, fi^s, ftps an d /j,45 on simulated data. The 
yellow histogram corresponds to the correct tree, the black and gray are the other two trees. 



Notice that since A is positive semidefinite, it can be written as A = UDU 1 where U is 
orthogonal and D is diagonal with non-negative entries. Thus the square root B = U \[DU t 
is unique. Now since ||x||^ = x l Ax = {Bx) t {Bx) = \\Bx\\ 2 , we can view learning such a 
metric as finding a transformation of the space of invariants that replaces each point x with 
Bx under the Euclidean norm. Accordingly, we will be searching for a positive semidefinite 
matrix A on the space of invariants which is "optimal" . 

Let p(9) be an empirical probability distribution generated from a phylogenetic model 
on tree T\ with parameters 9. We wish to find A such that the condition 

||f 1 (^)))|U<min(||f 2 (^))|U,||f 3 (^))|U) 

is typically true for most p(9) chosen from a suitable parameter space 0. 

Now suppose that is a finite set of parameters from which we generate training data 
fi(p(0)), f 2 (p(0)), fs(p(9)) for 9 G 0. As we saw above, each of the eight possible ways of 
writing each tree induces a signed permutation of the coordinates of each fj(p(0)). We write 
these permutations in matrix form as 7Ti, . . . , 7Tg. Given this training data, we wish to solve 
the following optimization problem. 

Algorithm 4. (Metric learning for invariants) 

Input: model invariants fj for Tj and a finite set of model parameters. 

Output: a semidefinite matrix A 

Procedure: 

1. Reduce the sets fj of invariants to a manageable size by testing individual invariants 
on data simulated from 9 G 0. 
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2. Augment the resulting sets so that they are closed under the eight permutations of the 
input which fix tree T\. 

3. Compute the signed permutations 7i"i, . . . , its which are induced on the invariants fi by 
the above permutations. 

4. Solve the following semidefinite programming problem: 

Minimize: J^eee + Atr ^ 

Subject to: ||X 1 (^))||i + 7<min(||X 2 (^))||i,||X 3 (^))||i)+^), 
Hi A = Aiii for 1 < i < 8, 
A y 0, and 

m > o, 

where A y denotes that A is a positive semidefinite matrix. 

5. Alternatively, if we restrict A to be diagonal, this becomes a linear program and can 
be solved for much larger sets of invariants and parameters. 

In the optimization step, we use a regularization parameter A to keep A small and a 
margin parameter 7 to increase the margin between the distributions. This is a convex opti- 
mization problem with a linear objective function and linear matrix equality and inequality 
constraints. Hence it is a semidefinite programming (SDP) problem. The SDP problem 
above has a unique optimizer and can be solved in polynomial time. Its complexity depends 
on the capacity of the set G since each point in contributes a linear constraint. 

For the range of parameters we consider in Section |3j we use #(©) = 1444. In our 
experiments to solve the SDP we use SeDuMi 1.1 |Stu99j or DSDP5 |BY05] with Y ALMIP 
[loI] as the parser. Matlab code to implement the above algorithm can be found at http : 
|//math. Stanford. edu/~y uany/metricPhylo/matlab/ 

We found that although SeDuMi often runs into numerical issues, it generally finds a 
good matrix A with competitive performance to the neighbor-joining algorithm. DSDP is 
better in dealing with numerical stability at the cost of more computational time. We have 
found that setting A = 0.0001 and 7 = 0.005 gives good results in our situation. For example, 
in the case of the JC69 model with a 49 x 49 semidefinite matrix A, YALMIP-SeDuMi takes 
55.7 minutes to parse the constraints and solve the SDP, while YALMIP-DSDP takes 167.1 
minutes to finish the same job. For details on experiments, see the next section. 

Our algorithm was inspired by some early results on metric learning algorithms such as 
[XNJR03] and [SSS N04j . which aim to find a (pseudo)-metric such that the mutual distances 
between similar examples are minimized while the distances across dissimilar examples or 
classes are kept large. Direct application of such an algorithm is not quite suitable in our 
setting. As shown in Figure |2j the correct tree points are overlapped by the two incorrect 
trees. For points in the overlapping region, it is hard to tell whether to shrink or stretch 
their mutual distance. However, when the points appear in triples, it is possible that for 
each triple the one closest to zero is generated from the correct tree. Our algorithm is based 
on such an intuition and proved successful in experiments. 
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After using Algorithm [4] to find a good metric A, the following simple algorithm allows 
us to construct trees on four taxa. 

Algorithm 5. (Tree construction with invariants) 

Input: A multiple alignment of 4 species and a semidefinite matrix A from Algorithm |4j 

Output: A phylogenetic tree on the 4 species (without branch lengths). 

Procedure: 

1. Form empirical distributions p by counting columns of the alignment. 

2. Form the vectors fj = (/j,i(p), . . . , fi,n(P)) for 1 < 2 < 3. 

3. Return Tj where the vector fj has smallest A-norm \\fi\\A — \/^\A^i. 

3 Results 

We tested our metric learning algorithms for the invariants f Jcm and f^ 81 as described 
above. We trained two metrics for JC69 on the tree in Figure [TJ The first used simulations 
from branch lengths between 0.01 and 0.75 on a grid with increments of magnitude 0.02, for 
a total of 1444 different parameters. This region of parameter space was chosen for direct 
comparison with |Hue95t ICFS06) . The second metric used parameters 0.01 < a < 0.25 and 
0.51 < b < 0.75 with increments of 0.01, giving 625 trees with very short interior edges. 
Similarly for K81, we learned a metric using parameters 0.01 < a,b < 0.75 and several 
sets of 7, a, (3. After learning metrics, we performed simulation tests on the same space 
of parameters that was used to train. We compared neighbor-joining [SN87j . phylogenetic 
invariants with the l\ or li norm and phylogenetic invariants with our learned norms. 

Since the edge lengths are large for part of the parameter space, we often see simulated 
alignments with more than 75% mismatches between pairs of taxa. In such a case, (|2]) 
returns infinite distance estimates under the Jukes-Cantor model. So the results depend on 
how neighbor joining treats infinite distances. In PHYLIP |Fel04], the program dnadist 
doesn't return a distance matrix if some distances are infinite. However, in PAUP*, infinite 
distances are set to a fixed large number. Since we are only concerned with the tree topology, 
we believe that the most fair comparison is between phylogenetic invariants and the method 
of PAUP*. However, it should be noted that this can make a major difference in results 
using neighbor joining, since often the correct tree can be returned even if some distances 
are infinite. See Figure [3] for an example of the difference and be warned that comparison 
between simulations studies done in different ways is difficult. 

Table [TJ shows the results of 100 simulations at each of the 1444 parameter values for 
various sequence lengths using the JC69 model. It gives the percent correct over all 144,400 
trials for five different methods: invariants with l\, I2, and ^4-norms and neighbor joining 
(using Jukes-Cantor distances and allowing infinite distances). The contour plots in Figure [3] 
show how the reconstruction rates vary across parameter space for the five methods for a 
sequence length of 100. Notice that the A-norm shows particularly good behavior over the 
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Table 1: Percent of trials reconstructed correctly for the Jukes-Cantor model over the entire 
parameter space and the Felsenstein zone for the respective metrics. 
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Table 2: Percent of trials reconstructed correctly for the Kimura 3-parameter model over 
the entire parameter space for two choices of 7, a, (3. 

entire range of parameters, even in the "Felsenstein zone" in the upper left corner. When 
trained on the Felsenstein zone, the learned metric can perform even better. Table [T] shows 
the result of training a metric on this zone. Notice that the A-norm is now quite a bit better 
than neighbor joining, even though the l\ and I2 norms are terrible. However, this learned 
norm is slightly worse on the whole parameter space than the metric trained on the whole 
space. 

Table [2] shows results for the K81 model under two choices of (7,0,/?). We only report 
the I2 scores, since the l\ scores are similar. Of note is the column "Z2 restrict" which shows 
the I2 norm on the top 52 invariants as ranked by individual power on simulations as in the 
previous section. This column is better than the I2 norm on all 11612 invariants, showing that 
many invariants are actually harmful. The A-norm again improves on even the restricted I2 
and beats neighbor-joining (run with K81 distances) on all examples. 
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Figure 3: Contour plots for the three reconstruction methods for the Jukes-Cantor model 
over parameter space with alignments of length 100. Black areas correspond to parameters 
(a, b) for which the tree was reconstructed correctly over 95% of the time, gray for over 50%, 
light gray for over 33%, and white for under 33%. 
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4 Discussion 



We have shown that machine learning algorithms can substantially improve the tree con- 
struction performance of phylogenetic invariants. As an example, for sequences of length 
100, the four-point invariant (Example [T]) for the K81 model is correct 82% of the time on 
data simulated from K81 with parameters (0.1,3.0,0.5). This is quite a bit better than the 
li norm on all 11612 invariants (76.8%, Table [2]). 

The paper [CFS07J describes an algebraic method for picking a subset of invariants for 
the K81 model. They reduce to 48 invariants which give an improvement over all 11612 
invariants (up to 82.6% on the above example using the li norm). However, of these 48, only 
4 of them are among the top 52 we selected for f K81 , and the remaining 44 invariants are 
mostly quite poor (42% average accuracy). After taking the closure of these 48 invariants, 
there are 156 total and the performance actually drops to 78.3%. It seems that the conditions 
for an invariant to be powerful are not particularly related to the algebraic criterion used in 
[CF507] . 

All invariant based methods heavily depend on the set of invariants that we begin with. 
Learning diagonal matrices A had mixed performance, which further suggests that the gen- 
erating set we are using for the invariants is non-optimal. We believe that it is an important 
mathematical problem to understand what properties are shared by the good invariants. We 
suggest that symmetry might be an important criterion to construct other polynomials like 
the four-point condition with good power. 

The learned metrics in this paper are somewhat dependant on the parameters chosen 
to train them. This can be a benefit, as it allows us to train tree construction algorithms 
for specific regions of parameter space (e.g., the Felsenstein zone). However, we hope that 
improvements to the metric programming will allow us to train on larger parameter sets and 
thus obtain uniformly better algorithms. 

Notice that these methods only recover the tree topology, not the edge lengths. We 
believe that if the edge lengths are needed, they should be estimated after building the tree, 
in which case standard statistical methods such as maximum likelihood can be used easily. 
While the invariants discussed in this paper may not be practical for large trees, we believe 
there is great use in understanding fully the problem of building trees on four taxa. For 
example, these methods can either be used as an input to quartet-based tree construction 
algorithms or as a verification step for larger phylogenetic trees. 

For a method of building trees on more than four taxa using phylogenetic invariants, see 
[Eri05, KKPP06J, which use numerical linear algebra to evaluate invariants given by rank 
conditions on certain matrices in order to construct phylogenetic trees. This amounts to 
evaluating many polynomials at once, allowing it to run in polynomial time. 

The matrices A for JC69 and K81 used in the tests can be found at htt p: //stanf ord7| 
edu/~nke/data/metricPhylo, A software package that can run these tests is available at 
the same website. It includes a program for simulating evolution using any Markov model 
on a tree and several programs using phylogenetic invariants. 
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